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Abstract 

Capturing the dependence structure of multivariate extreme events is a major 
concern in many helds involving the management of risks stemming from mul¬ 
tiple sources, e.g. portfolio monitoring, insurance, environmental risk man¬ 
agement and anomaly detection. One convenient (nonparametric) charac¬ 
terization of extreme dependence in the framework of multivariate Extreme 
Value Theory (EVT) is the angular measure, which provides direct informa¬ 
tion about the probable ’directions’ of extremes, that is, the relative con¬ 
tribution of each feature/coordinate of the ‘largest’ observations. Modeling 
the angular measure in high dimensional problems is a major challenge for 
the multivariate analysis of rare events. The present paper proposes a novel 
methodology aiming at exhibiting a sparsity pattern within the dependence 
structure of extremes. This is achieved by estimating the amount of mass 
spread by the angular measure on representative sets of directions, corre¬ 
sponding to specihc sub-cones of This dimension reduction technique 
paves the way towards scaling up existing multivariate EVT methods. Be¬ 
yond a non-asymptotic study providing a theoretical validity framework for 
our method, we propose as a direct application a -Erst- Anomaly Detec¬ 
tion algorithm based on multivariate EVT. This algorithm builds a sparse 
‘normal prohle’ of extreme behaviours, to be confronted with new (possibly 
abnormal) extreme observations. Illustrative experimental results provide 
strong empirical evidence of the relevance of our approach. 
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1. Introduction 


1.1. Context: multivariate extreme values in large dimension 

Extreme Value Theory (EVT in abbreviated form) provides a theoretical 
basis for modeling the tails of probability distributions. In many applied 
helds where rare events may have a disastrous impact, such as hnance, insur¬ 


ance, climate, environmental risk management, network monitoring (Finken- 


stadt and Rootzen (2003); Smith (2003)) or anomaly detection (Clifton et ah 
(2011); Lee and Roberts (2008)), the information carried by extremes is cru¬ 
cial. In a multivariate context, the dependence structure of the joint tail is 
of particular interest, as it gives access e.g. to probabilities of a joint excess 
above high thresholds or to multivariate quantile regions. Also, the distri¬ 
butional structure of extremes indicates which components of a multivariate 
quantity may be simultaneously large while the others stay small, which is a 
valuable piece of information for multi-factor risk assessment or detection of 
anomalies among other -not abnormal- extreme data. 

In a multivariate ‘Peak-Over-Threshold’ setting, realizations of a d - 
dimensional random vector Y = (Yi,..., Y^) are observed and the goal pur¬ 
sued is to learn the conditional distribution of excesses, [Y | ||Y|| >r], 
above some large threshold r > 0. The dependence structure of such ex¬ 
cesses is described via the distribution of the ‘directions’ formed by the 
most extreme observations, the so-called angular measure, hereafter denoted 
by $. The latter is dehned on the positive orthant of the d — 1 dimen¬ 
sional hyper-sphere. To wit, for any region A on the unit sphere (a set 
of ‘directions’), after suitable standardization of the data (see Section |^, 
C$(A) ~ P(||Y||“^Y G A I ||Y|| > r), where C is a normalizing constant. 
Some probability mass may be spread on any sub-sphere of dimension k < d, 
the fc-faces of an hyper-cube if we use the inhnity norm, which complexihes 
inference when d is large. To hx ideas, the presence of <I>-mass on a sub¬ 
sphere of the type {maxi<j<fc Xi = 1 ; Xi > 0 {i < k) ] Xk+i = ... = Xd = 0} 
indicates that the components Yi,... ,Yk may simultaneously be large, while 
the others are small. An extensive exposition of this multivariate extreme 
setting may be found e.g. in Resnick (|1987), Beirlant et ah (2004). 
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Parametric or semi-parametric modeling and estimation of the structure 
of multivariate extremes is relatively well documented in the statistical lit- 


erature, see e.g. 

Coles and Tawn 

(1991 

); 

Fougeres et al. 

(2009 

); 

Cooley 

et al. 

(2010 ; 

Sabourin and Naveau 

(2012 

) and the references therein. In a 


non-parametric setting, there is also an abundant literature concerning con¬ 
sistency and asymptotic normality of estimators of functionals characterizing 
the extreme dependence structure, e.g. extreme value copulas or the stable 
tail dependence function (STDF), see Segers (2012), Drees and Huang (1998), 
Embrechts et ah ( 2000| , Einmahl et al. ( 2012^ 7 de Haan and Ferreira (2006). 


In many applications, it is nevertheless more convenient to work with the 
angular measure itself, as the latter gives more direct information on the 
dependence structure and is able to reflect structural simplifying properties 
{e.g. sparsity as detailed below) which would not appear in copulas or in 
the STDF. However, non-parametric modeling of the angular measure faces 
major difficulties, stemming from the potentially complexe structure of the 
latter, especially in a high dimensional setting. Further, from a theoretical 
point of view, non-parametric estimation of the angular measure has only 


been studied in the two dimensional case, in Einmahl et al. (2001) and Ein¬ 


mahl and Segers (2009), in an asymptotic framework. 


Scaling up multivariate EVT is a major challenge that one faces when con¬ 
fronted to high-dimensional learning tasks, since most multivariate extreme 
value models have been designed to handle moderate dimensional problems 
(say, of dimensionality d < 10). For larger dimensions, simplifying modeling 
choices are needed, stipulating e.g that only some pre-dehnite subgroups of 
components may be concomitantly extremes, or, on the contrary, that all of 
them must be (see e.g. Stephenson (2009) or Sabourin and Naveau (2012)). 
This curse of dimensionality can be explained, in the context of extreme 
values analysis, by the relative scarcity of extreme data, the computational 
complexity of the estimation procedure and, in the parametric case, by the 
fact that the dimension of the parameter space usually grows with that of 
the sample space. This calls for dimensionality reduction devices adapted to 
multivariate extreme values. 

In a wide range of situations, one may expect the occurrence of two 
phenomena: 

1- Only a ‘small’ number of groups of components may be concomitantly 
extreme, so that only a ‘small’ number of hyper-cubes (those corresponding 
to these subsets of indexes precisely) have non zero mass (‘small’ is relative 
to the total number of groups 2“*). 
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2- Each of these groups contains a limited number of coordinates (compared 
to the original dimensionality), so that the corresponding hyper-cubes with 
non zero mass have small dimension compared to d. 

The main purpose of this paper is to introduce a data-driven methodology for 
identifying such faces, so as to reduce the dimensionality of the problem and 
thus to learn a sparse representation of extreme behaviors. In case hypothesis 
2- is not fulhlled, such a sparse ‘prohle’ can still be learned, but looses the 
low dimensional property of its supporting hyper-cubes. 

One major issue is that real data generally do not concentrate on sub¬ 
spaces of zero Lebesgue measure. This is circumvented by setting to zero 
any coordinate less than a threshold e > 0, so that the corresponding ‘angle’ 
is assigned to a lower-dimensional face. 


The theoretical results stated in this paper build on the work of Goix et al. 


(2015), where non-asymptotic bounds related to the statistical performance 
of a non-parametric estimator of the STDF, another functional measure of 
the dependence structure of extremes, are established. However, even in the 
case of a sparse angular measure, the support of the STDF would not be so, 
since the latter functional is an integrated version of the former (see (2.7), 
Section]^. Also, in many applications, it is more convenient to work with the 
angular measure. Indeed, it provides direct information about the probable 
‘directions’ of extremes, that is, the relative contribution of each components 
of the ‘largest’ observations (where ‘large’ may be understood e.g. in the 
sense of the inhnity norm on the input space). We emphasize again that 
estimating these ‘probable relative contributions’ is a major concern in many 
helds involving the management of risks from multiple sources. To the best 
of our knowledge, non-parametric estimation of the angular measure has 
only been treated in the two dimensional case, in Einmahl et al. (2001) and 


Einmahl and Segers (2009), in an asymptotic framework. 


Main contributions. The present paper extends the non-asymptotic bounds 
proved in Goix et al. (2015) to the angular measure of extremes, restricted to 
a well-chosen representative class of sets, corresponding to lower-dimensional 
regions of the space. The objective is to learn a representation of the angular 
measure, rough enough to control the variance in high dimension and accu¬ 
rate enough to gain information about the ’probable directions’ of extremes. 
This yields a -Erst- non-parametric estimate of the angular measure in any 
dimension, restricted to a class of sub-cones, with a non asymptotic bound on 
the error. The representation thus obtained is exploited to detect anomalies 
among extremes. 
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The proposed algorithm is based on dimensionality reduction. We believe 
that our method can also be used as a preprocessing stage, for dimensionality 
reduction purpose, before proceeding with a parametric or semi-parametric 
estimation which could beneht from the structural information issued in the 
hrst step. Such applications are beyond the scope of this paper and will be 
the subject of further research. 


1.2. Application to Anomaly Detection 

Anomaly Detection (AD in short, and depending of the application do¬ 
main, outlier detection, novelty detection, deviation detection, exception 
mining) generally consists in assuming that the dataset under study con¬ 
tains a small number of anomalies, generated by distribution models that 
differ from that generating the vast majority of the data. This formulation 
motivates many statistical AD methods, based on the underlying assumption 
that anomalies occur in low probability regions of the data generating pro¬ 
cess. Here and hereafter, the term ‘normal data’ does not refer to Gaussian 
distributed data, but to not abnormal ones, i.e. data belonging to the above 
mentioned majority. Classical parametric techniques, like those developed in 


Barnett and Lewis (1994) or in Eskin (2000), assume that the normal data 


are generated by a distribution belonging to some specihc, known in advance 
parametric model. The most popular non-parametric approaches include al¬ 
gorithms based on density (level set) estimation (see e.g. Scholkopf et ah 


(2001), Scott and Nowak (2006) or Breunig et al. (1999)), on dimensi 

onality 

ecision 

reduction (c/ Shyu et al. (2003), Aggarwal and Yu (2001)) or on c 

trees (jLiu et al. ( 

2008)). One may refer to 

Hodge and Austin (2004), 

Chan- 

do la et al. (2009 

), Patcha and Park (2007 

) and Markon and Singh 

(2003) 


for excellent overviews of current research on Anomaly Detection, ad-hoc 
techniques being far too numerous to be listed here in an exhaustive man¬ 
ner. The framework we develop in this paper is non-parametric and lies at 
the intersection of support estimation, density estimation and dimensionality 
reduction: it consists in learning from training data the support of a distri¬ 
bution, that can be decomposed into sub-cones, hopefully of low dimension 
each and to which some mass is assigned, according to empirical versions of 
probability measures on extreme regions. 

EVT has been intensively used in AD in the one-dimensional situation. 


see for instance Roberts (1999), Roberts (2000), Clifton et ah (2011), Clifton 


et al. (2008), Lee and Roberts (2008). In the multivariate setup, however, 
there is -to the best of our knowledge- no anomaly detection method relying 
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on multivariate EVT. Until now, the multidimensional case has only been 
tackled by means of extreme value statistics based on univariate EVT. The 
major reason is the difficulty to scale up existing multivariate EVT models 
with the dimensionality. In the present paper we bridge the gap between 
the practice of AD and multivariate EVT by proposing a method which is 
able to learn a sparse ‘normal prohle’ of multivariate extremes and, as such, 
may be implemented to improve the accuracy of any usual AD algorithm. 
Experimental results show that this method significantly improves the per¬ 
formance in extreme regions, as the risk is taken not to uniformly predict 
as abnormal the most extremal observations, but to learn their dependence 
structure. These improvements may typically be useful in applications where 
the cost of false positive errors [i.e. false alarms) is very high {e.g. predictive 
maintenance in aeronautics). 

The structure of the paper is as follows. The whys and wherefores of 
multivariate EVT are explained in the following Section]^ A non-parametric 
estimator of the subfaces’ mass is introduced in Section the accuracy of 
which is investigated by establishing hnite sample error bounds relying on VC 
inequalities tailored to low probability regions. An application to Anomaly 
Detection is proposed in Section where some background on AD is pro¬ 
vided, followed by a novel AD algorithm which relies on the above mentioned 
non-parametric estimator. Experiments on both simulated and real data are 
performed in Section Technical details are deferred to the Appendix sec¬ 
tion. 

2. Multivariate EVT Framework and Problem Statement 

Extreme Value Theory (EVT) develops models for learning the unusual 
rather than the usual, in order to provide a reasonable assessment of the 
probability of occurrence of rare events. Such models are widely used in helds 
involving risk management such as Finance, Insurance, Operation Research, 
Telecommunication or Environmental Sciences for instance. For clarity, we 
start off with recalling some key notions pertaining to (multivariate) EVT, 
that shall be involved in the formulation of the problem next stated and in 
its subsequent analysis. 

2.1. Notations 

Throughout the paper, bold symbols refer to multivariate quantities, and 
for m G M U {c)o}, m denotes the vector (m, ...,m). Also, comparison 
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operators between two vectors (or between a vector and a real number) are 
understood component-wise, i.e. ‘x < z’ means ''Xj < Zj for all ^ < J <d' 
and for any real number T, ‘x < T’ means < T for all 1 < j < d’. We 
denote by [mJ the integer part of any real number u, by = max( 0 , u) 
its positive part and by 6^ the Dirac mass at any point a G For uni¬ 
dimensional random variables Yi,..., Y(i) < ... < Y(„) denote their order 
statistics. 


2.2. Background on (multivariate) Extreme Value Theory 

In the univariate case, EVT essentially consists in modeling the distri¬ 
bution of the maxima {resp. the upper tail of the r.v. under study) as a 
generalized extreme value distribution, namely an element of the Gumbel, 
Frechet or Weibull parametric families {resp. by a generalized Pareto dis¬ 
tribution). It plays a crucial role in risk monitoring: consider the (1 — p)*^ 
quantile of the distribution F of a r.v. X, for a given exceedance probability 
p, that is Xp = inf{x G M, P(X > x) < p}. For moderate values of p, a natu¬ 
ral empirical estimate is Xp^n = inf{x G M, ^{Xi>x} < p}- However, 

if p is very small, the hnite sample Xi, ...,Xn carries insufficient infor¬ 
mation and the empirical quantile Xp^n becomes unreliable. That is where 
EVT comes into play by providing parametric estimates of large quantiles: 
whereas statistical inference often involves sample means and the Central 
Limit Theorem, EVT handles phenomena whose behavior is not ruled by 
an ‘averaging effect’. The focus is on the sample maximum rather than the 
mean. The primal assumption is the existence of two sequences {a„,n > 1} 
and {bn,n > 1 }, the a„’s being positive, and a non-degenerate distribution 
function G such that 

limnPf ^—— > X) = —logG(x) ( 2 . 1 ) 

n^cx \ On J 


for all continuity points x G M of G. If this assumption is fulhlled - it is the 
case for most textbook distributions ~ then F is said to lie in the domain 
of attraction of G: F G DA{G). The tail behavior of F is then essentially 
characterized by G, which is proved to be - up to re-scaling - of the type 
G(x) = exp(—(1 -|- 7 x)“^/'^) for 1 -|- 7 X > 0, 7 G M, setting by convention 
(1 -l- 7 x)“^/'’' = e~^ for 7 = 0. The sign of 7 controls the shape of the tail and 
various estimators of the re-scaling sequence and of the shape index 7 as well 
have been studied in great detail, see e.g. Dekkers et al. (1989), Einmahl 


et al. (2009), Hill (1975), Smith (1987), Beirlant et al. (1996). 
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Extensions to the multivariate setting are well understood from a prob¬ 
abilistic point of view, but far from obvious from a statistical perspective. 
Indeed, the tail dependence structure, ruling the possible simultaneous oc¬ 
currence of large observations in several directions, has no hnite-dimensional 
parametrization. 

The analogue of (2.1) for a d-dimensional r.v.X = (X^, ..., X'^) with 
distribution F(x) := P(Xi < xi,... ,Xd < Xd), namely F G DA(G) stipu¬ 
lates the existence of two sequences {a„,n > 1 } and {b„,n > 1 } in M'^, the 
a„’s being positive, and a non-degenerate distribution function G such that 


lim n P 

n^oo 


X^-bi 


> Xi or 


X'^-bt 


or 


> Xd] = -logG(x) (2.2) 


for all continuity points x G of G. This clearly implies that the margins 
Gi(a;i),..., Gd{xd) are univariate extreme value distributions, namely of the 
type Gj{x) = exp (—(1 -(-Also, denoting by Fi, ..., Fd the 


marginal distributions of F, Assumption (2.2) implies marginal convergence: 
Fi G DA{Gi) for i = 1, ..., n. To understand the structure of the limit G 
and dispose of the unknown sequences (a,^, b„) (which are entirely determined 
by the marginal distributions F^’s), it is convenient to work with marginally 
standardized variables, that is, to separate the margins from the dependence 
structure in the description of the joint distribution of X. Consider the 
standardized variables = 1/(1 — Fj{X^)) and V = ..., V^). In 

fact (see Proposition 5.10 in Resnick| (1987)), Assumption (2.2) is equivalent 
to marginal convergences Fj G DA{Gj) as in (2.1), together with standard 


multivariate regular variation of V’s distribution, which means existence of 
a limit measure /i on [ 0 , cxd]'^ \ {0 } such that 


n P — > Ui or ■ ■ ■ or — > Vd 
n n 


G/i([0,v]'^), (2.3) 


where [0,v] := [ 0 , Ui] X ■ ■ ■ X [0, Vd\- Thus, the variable V satishes ( |2.2[ ) 
with a„ = n = (n, ..., n), b„ = 0 = (0, ..., 0). The dependence structure 
of the limit G in ( |2.2 ) can be expressed by means of the so-termed exponent 
measure p: 


- logG(x) = /i 


^ logGi(xi)’‘■ ■ ’ logGrf(xrf)) 



















The latter is finite on sets bounded away from 0 and has the homogeneity 
property : fi{t ■) = ■). Observe in addition that, due to the standard¬ 

ization chosen (with ‘nearly’ Pareto margins), the support of fi is included in 
[0, 1]'’. To wit, the measure /i should be viewed, up to a a normalizing fac¬ 
tor, as the asymptotic distribution of V in extreme regions. For any borelian 
subset A bounded away from 0 on which /r is continuous, we have 


tF{VetA) - > fi{A). (2.4) 

t^OO 


Using the homogeneity property fi(t ■) = ■), one may show that fi can 

be decomposed into a radial component and an angular component $, which 
are independent from each other (see e.g. de Haan and Resnick ( 1977| )). 
Indeed, for all v = (ui, ...,Vd) E M'’*, set 


I ^(v) 

[©(v) 



d 

max Vi, 
2 = 1 


Vl Vd \ 

R(v)’-’R(v)y' 




d-l 


(2.5) 


where is the positive orthant of the unit sphere in for the infinity 
norm. Define the spectral measure (also called angular measure) by <h(i?) = 
/i({v ; i?(v) > l,0(v) G B}). Then, for every B C 

/r{v : R(v) > z,0(v) G R} = ^-^$(5) . (2.6) 


In a nutshell, there is a one-to-one correspondence between the exponent 
measure /i and the angular measure <I>, both of them can be used to charac¬ 
terize the asymptotic tail dependence of the distribution F (as soon as the 
margins Fj are known), since 




'e&si 


max OjXj d<I)(0), 

d-l j 


(2.7) 


this equality being obtained from the change of variable (2.5) , see e.g. Propo¬ 


sition 5.11 in Resnick (1987). Recall that here and beyond, operators on 
vectors are understood component-wise, so that = (x7\ ..., x)^). The 
angular measure can be seen as the asymptotic conditional distribution of 
the ‘angle’ 0 given that the radius R is large, up to the normalizing constant 
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$(S'^ ^). Indeed, dropping the dependence on V for convenience, we have 
for any continuity set A of <I>, 


P(0 e A \ R> r) 


rF{QeA,R>r) _ $(A) 

rP(i? > r) r^oo <I)(S'^^) 


( 2 . 8 ) 


The choice of the marginal standardization is somewhat arbitrary and al¬ 
ternative standardizations lead to different limits. Another common choice 
consists in considering ‘nearly uniform’ variables (namely, uniform variables 
when the margins are continuous): dehning \J hy = 1 — Fj{X^) for 
j G {1,... ,d}, Condition (2.3) is equivalent to each of the following condi¬ 
tions: 


• U has ‘inverse multivariate regular variation’ with limit measure A( ■) 

:= namely, for every measurable set A bounded away from 

-|-oo which is a continuity set of A, 

tF(Uet-^A) -^ A(A) =/x(A-^), (2.9) 

' ' t—t'OO 

where A~^ = {u G • • • y'^d^) ^ limit measure A is 

hnite on sets bounded away from {-|-oo}. 

• The stable tail dependence function (STDF) dehned for x G [0, oo], x 7 ^ 
00 by 

/(x) = lirnf^P f Xi or ... or < txd) = (2.10) 

exists. 


2.3. Statement of the Statistical Problem 

The focus of this work is on the dependence structure in extreme regions 
of a random vector X in a multivariate domain of attraction (see (2.1)). 
This asymptotic dependence is fully described by the exponent measure fi, 
or equivalently by the spectral measure <1>. The goal of this paper is to 
infer a meaningful (possibly sparse) summary of the latter. As shall be 
seen below, since the support of /x can be naturally partitioned in a specihc 
and interpretable manner, this boils down to accurately recovering the mass 
spread on each element of the partition. In order to formulate this approach 
rigorously, additional dehnitions are required. 
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Truncated cones. For any non empty snbset of featnres a C {1, ..., d}, 
consider the trnncated cone (see Fig. 

= {v > 0, ||v||oo > 1, Vj > 0 for j e a, Vj = 0 for j ^ a}. (2.11) 

The corresponding snbset of the sphere is 

fla = {x G : Xj > 0 for i G a , Xi = 0 for i ^ a} = fl Ca, 

and we clearly have fi{Ca) = d’(f^a) for any 0 7 ^ a C {1, ..., d}. The collec¬ 
tion {Ca ■ 07 ^a!C{l, ..., d}} forming a partition of the trnncated positive 
orthant \ [0,1], one may natnrally decompose the exponent measnre as 

h= (2.12) 

where each component Ha is concentrated on the nntrnncated cone corre¬ 
sponding to Ca- Similarly, the fla’s forming a partition of we have 

«> = E . 

07^oC{l,...,rf} 

where denotes the restriction of $ to Hq for all 0 7 ^ a C {1, ..., d}. 
The fact that mass is spread on Ca indicates that conditioned npon the event 
‘/?(V) is large’ (he. an excess of a large radial threshold), the components 
V^{j G a) may be simnltaneonsly large while the other {j ^ a) are 
small, with positive probability. Each index snbset a thns dehnes a specihc 
direction in the tail region. 

However this interpretation shonld be handled with care, since for a 7 ^ 
{ 1 ,..., d}, if fi{Ca) > 0 , then Ca is not a continnity set of /i (it has empty 
interior), nor IIq, is a continnity set of <h. Thns, the qnantity tP(V G tCa) does 
not necessarily converge to p(Cq) as t ^ -f-oo. Actnally, if F is continnons, 
we have P(V G tCa) = 0 for any f > 0. However, consider for e > 0 the 
e-thickened rectangles 

Ra = {v > 0, ||v||oo > 1, Vj > e for j G a, vj < e for j ^ a}, (2.13) 

Since the bonndaries of the sets Ra are disjoint, only a conntable nnmber of 
them may be discontinnity sets of p. Hence, the threshold e may be chosen 
arbitrarily small in snch a way that Ra is a continnity set of fi. The resnlt 
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stated below shows that nonzero mass on is the same as nonzero mass on 
for e arbitrarily small. 


Cl,2,3 



Figure 1: Truncated cones in 3D Figure 2: Truncated e-rectangles in 2D 

Lemma 1. For any non empty index subset 0 7^ a C {1,, d}, the exponent 
measure of Ca is 

e-S>0 

Proof. First consider the case a = {1,..., d}. Then Rifs forms an increasing 
sequence of sets as e decreases and Cq, = = Ue>o,eeQ Rl- The result follows 

from the ‘continuity from below’ property of the measure fi. Now, for e > 0 
and a C {1, ..., d}, consider the sets 

= {x e :Wj e a : Xj > e}, 

Nf = {x G E a : Xj > e,3j ^ a : Xj > e}, 

so that Nf C Of and Rl = Off Nf. Observe also that = 0° \ Nf. Thus, 
ti{Rl) = /i(Oa) — cind yi{Ca) = P'iOff) — so that it is sufficient 

to show that 


KK) = and /i(0°) = hin/i(0^). 

Notice that the iV^’s and the 0^’s form two increasing sequences of sets 
(when e decreases), and that Nf = U.>o,6eQ^a> = Ue>o,.eQ<^«- This 

proves the desired result. □ 

We may now make precise the above heuristic interpretation of the quan¬ 
tities the vector Ai = {fi{Ca) ■ 0 7^ a C {1, ..., d}} asymptotically 

describes the dependence structure of the extremal observations. Indeed, by 
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Lemma and the discussion above, e may be chosen such that is a con¬ 
tinuity set of /i, while is arbitrarily close to ^{Ca)- Then, using the 

characterization (2.4) of /r, the following asymptotic identity holds true: 

lim tP (||V||oo >t,V^> et {j e a), < et {j ^ a)) = /u(R"J 

t—>oo 

- /i(Ca). 


Remark 1. In terms of conditional probabilities; denoting R = ||T(X)||, 
where T is the standardization map X i—)■ V, we have 

rP(V G rRD ^ /i(R^) 


P(T(X) erRl^ \ R>r) = 


rP(V G r([0,1] 


/i([0,l] 


as in ( |2.8[ ) . In other terms, 

P (y^ > er {j G a), < er {j ^ a) 


|v|u>r) — >cyR^^) 

r^oo 

~ C/i(Ca), 


where C = l/<h(S'^ = l//r([0,1]'^). This clarifies the meaning of ‘large’ 

and ‘small’ in the heuristic explanation given above. 

Problem statement. As explained above, our goal is to describe the depen¬ 
dence on extreme regions by investigating the structure of /i (or, equivalently, 
that of $). More precisely, the aim is twofold. First, recover a rough approx¬ 
imation of the support of <F based on the partition a C {1,..., d}, a 7^ 
0 }, that is, determine which f2„’s have nonzero mass, or equivalently, which 
ys {resp. <Fa’s) are nonzero. This support estimation is potentially sparse 
(if a small number of (2^ have non-zero mass) and possibly low-dimensional 
(if the dimension of the sub-cones ffa with non-zero mass is low). The second 
objective is to investigate how the exponent measure fi spreads its mass on 
the Ca’s, the theoretical quantity fi{Ca) indicating to which extent extreme 
observations may occur in the ‘direction’ a for 0 7^ a C {1, ..., d}. These 
two goals are achieved using empirical versions of the angular measure de- 
hned in Section 3.1, evaluated on the e-thickened rectangles Formally, 
we wish to recover the (2“* — l)-dimensional unknown vector 


from Xi, 


X, 


\\M 


M = {/i(C) : 0 ^ « C {1, . . . , d}} 

F and build an estimator M. such that 
Al||oo= sup \M{a)-yCo)\ 

0^oC{l, ..., d} 


(2.14) 


i.i.d. 

rsj 
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is small with large probability. In view of Lemma (biased) estimates of 
A^’s components are built from an empirical version of the exponent measure, 
evaluated on the e-thickened rectangles (see Section [3d(| below). As a by¬ 
product, one obtains an estimate of the support of the limit measure p, 

U 

a: A4{a)>0 

The results stated in the next section are non-asymptotic and sharp bounds 
are given by means of VC inequalities tailored to low probability regions. 


2.4- Regularity Assumptions 

Beyond the existence of the limit measure /i {i.e. multivariate regular 
variation of V’s distribution, see (2.3)), and thus, existence of an angular 
measure $ (see (2.6)), three additional assumptions are made, which are 
natural when estimation of the support of a distribution is considered. 


Assumption 1. The margins of IK. have continuous c.d.f., namely Fj, 1 < 
j < d is continuous. 


Assumption is widely used in the context of non-parametric estimation of 
the dependence structure (see e.g. Einmahl and Segers (2009)): it ensures 
that the transformed variables = (1 — Fj(V'^))“^ (resp. = 1 — Fj{X^)) 
have indeed a standard Pareto distribution, P(V-^ > x) = 1/x, x > 1 {resp. 
the V-^’s are uniform variables). 


For any non empty subset a of {1, ..., d}, one denotes by dxa the 
Lebesgue measure on Ca and write dxa = dxjj ... dx^, when a = {A, • • •, A}- 
For convenience, we also write dxQ,\j instead of da;Q,\{j}. 


Assumption 2. Each component /Iq, of (2.12) is absolutely continuous w.r.t. 


Lebesgue measure dx^ on Ca. 


Assumption has a very convenient consequence regarding $: the fact that 
the exponent measure fv spreads no mass on subsets of the form {x : ||x||oo > 
= ••• = Xi^ 7 ^ 0} with r > 2, implies that the spectral measure d* 
spreads no mass on edges {x : ||x||oo = 1, = • • • = = 1} with r > 2 . 

This is summarized by the following result. 


Lemma 2. Under Assumption^ the following assertions holds true. 
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• ^ is concentrated on the (disjoint) edges 

^a,io = {x ; ||x||oo = 1, XiQ = l, 0 < Xi < 1 foriea\ {zq} 

Xi = 0 for i ^ a } 

for io G 0 7 ^ a C {1, ..., d}. 

• The restriction of $ to Tla,io is absolutely continuous w.r.t. the 
Lebesgue measure dXa\jQ on the cube’s edges, whenever |a| > 2. 

Proof. The first assertion straightforwardly results from the discussion above. 
Turning to the second point, consider any measurable set D C such 
that J^dXa\io = 0. Then the induced truncated cone D = {v : ||v||oo > 

Ijv/llviloo e D} satisfies Jfjdxa = 0 and belongs to Ca- Thus, by virtue of 
Assumption]^ = $„(£>) = fia{D) = 0. □ 

It follows from Lemma that the angular measure $ decomposes as $ = 
Xla there exist densities , |a| > 2 , zq G a, such 

that for all B C hla, |a| >2, 

r d<h ■ 

4(B) = 4„(B) = Y1 / (2.16) 

In order to formulate the next assumption, for \fd\ >2, we set 

Mji = sup sup ^ (2-16) 

iS/S dx^yj 

Assumption 3. ("Sparse Support^ The angular density is uniformly bounded 
on (V|/3| > 2, Mi 3 < oo), and there exists a constant M > 0, such that 
we have X]|/ 3 |> 2 ^’ where the sum is over subsets (3 of {1,... ,d} which 
contain at least two elements. 


Remark 2. The constant M is problem dependent. However, in the case 
where our representation A4 defined in (2.14) is the most informative about 
the angular measure, that is, when the density of is constant on PLa, we 
have M < d: Indeed, in such a case, M < X]|/ 3|>2 ~ X]|/ 3 |> 2 — 

< /i([0,1]'^). The eguality inside the last expression comes from 
the fact that the Lebesgue measure of a sub-sphere hlo, is \a\, for |q!| > 2. 
Indeed, using the notations defined in Lemma^ Lla = Uioea^«.*o' 
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the edges ^a,io being unit hypercube. Now, /i([0,1]'^) < p,{{v, 3j, Vj > 1} < 
dp,{{v, vi > 1})) < d. 

Note that the summation J2\i3\>2^h\f^\ smaller than d despite the (poten¬ 
tially large) factors |/3|. Considering Yli\p\> 2 ^P reasonable: in par¬ 

ticular, M will be small when only few Cla’s have non-zero ^-mass, namely 
when the representation vector M. defined in (2.14) is sparse. 


Assumption is naturally involved in the derivation of upper bounds on 
the error made when approximating yi{Ca) by the empirical counterpart of 
lJi{Ra)- The estimation error bound derived in Section depends on the 
sparsity constant M. 


3. A non-parametric estimator of the subcones’ mass : definition 
and preliminary resnlts 

In this section, an estimator Ad (a) of each of the sub-cones’ mass ^(Cq), 
0 7 ^ a C {1,..., d}, is proposed, based on observations Xi,...., X„, i.i.d. 
copies of X ~ F. Bounds on the error ||Ad — A4||oo are established. In the 
remaining of this paper, we work under Assumption (continuous margins, 
see Section |2.4[ ) . Assumptions and are not necessary to prove a prelimi¬ 
nary result on a class of rectangles (Propositionand Corollary]^. However, 
they are required to bound the bias induced by the tolerance parameter e (in 
Lemma 1^ Proposition and in the main result. Theorem [^. 

3.1. A natural empirical version of p. 

Since the marginal distributions Fj are unknown, we classically consider 
the empirical counterparts of the Vj’s, Vj = {Vf,... ,Vf), 1 < i < n, as 
standardized variables obtained from a rank transformation (instead of a 
probability integral transformation), 

V. = ((1 - it W))-‘) 

V / l<j<d 

where Fj{x) = (1/n) '^{xNx}- denote by T {resp. T) the standard¬ 

ization (resp. the empirical standardization). 


T(x) = 




and T(x) = 


i<j<d 


1 - Fj{xi) 


(3.1) 


l<j<d 
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The empirical probability distribution of the rank-transformed data is then 
given by 


P, 




2 = 1 


Since for a p-continuity set A bounded away from 0, f P (V G tA) —>■ /j,(A) 


as t —>■ oc, see (2.4), a natural empirical version of /i is defined as 


f^n{A) = -Fn{-A) = 


2=1 


(3.2) 


Here and throughout, we place ourselves in the asymptotic setting stipulating 
that k = k{n) > 0 is such that k ^ oo and k = o{n) as n —)■ cx). The ratio 
n/k plays the role of a large radial threshold. Note that this estimator is 
commonly used in the held of non-parametric estimation of the dependence 
structure, see e.g. 


Einmahl and Segers (2009) 


3.2. Accounting for the non asymptotic nature of data: e-thickening. 

Since the cones Ca have zero Lebesgue measure, and since, under As¬ 
sumption the margins are continuous, the cones are not likely to receive 
any empirical mass, so that simply counting points in |Cq, is not an option: 
with probability one, only the largest dimensional cone (the central one, cor¬ 
responding to a = {!,..., d}) will be hit. In view of Subsection 2.3 and 


Lemma [T| it is natural to introduce a tolerance parameter e > 0 and to ap¬ 
proximate the asymptotic mass of with the non-asymptotic mass of R%. 
We thus dehne the non-parametric estimator M{a) of p(Cq 


as 


M{a) = UniRo 


^ a G {1 ,..., d}. 


(3.3) 


as 


Evaluating M.{a) boils down (see (3.2)) to counting points in {n/k)Rl 
illustrated in Figure 3 The estimate Ai(a) is thus a (voluntarily e-biased) 
natural estimator of <h(f2Q,) = pi{Ca). 

The coefficients (Ad(a))Qc{i,...,d} related to the cones Ca constitute a sum¬ 
mary representation of the dependence structure. This representation is 
sparse as soon as the are positive only for a few groups of features a 
(compared to the total number of groups or sub-cones, 2“^ namely). It is is 
low-dimensional as soon as each of these groups a is of small cardinality, or 
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n 

V 


equivalently the corresponding sub-cones are low-dimensional compared with 
d. _ 

In fact, Ai{a) is (up to a normalizing constant) an empirical version of 
the conditional probability that T(X) belongs to the rectangle given 
that ||T(X)|| exceeds a large threshold r. Indeed, as explained in Remark]^ 

M(a) = lim /i([0,1]") P(T(X) e | ||T(X)|| > r). (3.4) 

r^oo 


The remaining of this section is devoted to obtaining non-asymptotic um- 
per bounds on the error 11| |oo- The main result is stated in Theorem 1 
Before all, notice that the error may be obviously decomposed as the sum o 
a stochastic term and a bias term inherent to the e-thickening approach: 


||A^-A^||oo = max |/i„(R^) -/i(C„) I 


< max Ip -/i„|(R^) -h max|p(R^)-p(C„)| 


(3.5) 


Here and beyond, for notational convenience, we simply denotes ‘a’ for ‘a 
non empty subset of {1, ..., d}’. The main steps of the argument leading to 
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Theorem are as follows. First, obtain a uniform upper bound on the error 
|/i„ — /i| restricted to a well chosen VC class of rectangles (Subsection 3.3), 


and deduce an uniform bound on |/r„ — /i|(i?^) (Subsection 3.4). Finally, 


using the regularity assumptions (Assumptionand Assumption]^, bound 
the difference |/i(i?Q) — fi{Ca)\ (Subsection |3.5[ ). 

3.3. Preliminaries: uniform approximation over a VC-class of rectangles 


This subsection builds on the theory developed in Goix et ah (2015), 


where a non-asymptotic bound is stated on the estimation of the stable tail 


dependence function (STDF) dehned in (2.10). The STDF I is related to the 
class of sets of the form [0, v]'^ (or [u, cx)]'^ depending on which standardization 
is used), and an equivalent dehnition is 

/(x) := lim tF{t~^x.) = p([0,x“^]'^) (3.6) 

t^OO 

with F’(x) = (1 — F')((l — ..., (1 — Fd)^{xd)). Here the notation 

(1 — Fj)^{xj) denotes the quantity sup{j/ : 1 — Fj{y) > Xj}. Recall that the 
marginally uniform variable U is dehned by = 1 — Fj{X^) (1 < j < d). 
Then in terms of standardized variables UC 


F(x) = P(^lJ{f/^' < Xj}j = P(U G [x,oo[") = P(V G [0,x-^]"). (3.7) 

i=i 

A natural estimator of I is its empirical version dehned as follows, see 


Huang (1992), Qi (1997), Drees and Huang (1998), Einmahl et ah (2006), 
Goix et al. ( 2015| ): 




2 = 1 


Vx^>X} or ... or X'^>Xf ,, 

t I— (n—[fccci J+1) (n—+ 1) J 


(3.8) 


The expression is indeed suggested by the dehnition of I in (3.6), with all 


distribution functions and univariate quantiles replaced by their empirical 
counterparts, and with t replaced by n/k. The following lemma allows to 
derive alternative expressions for the empirical version of the STDF. 

Lemma 3. Consider the rank transformed variables Uj = (V,)”^ = (1 — 
Fj{X-))i<j<d for i = 1,..., n. Then, for (i, j) G {1,..., n} x {1,..., d}, with 
probability one, 

k 


n 


D/ < -x-^ ^ Vf > yXi ^ Xf > 
n ^ k 


^ W < Ul 


n"'] ^ [fcx. ’■J+l) * ([fcx. ^J) 


19 






































The proof of Lemma is standard and is provided in Appendix A for 
completeness. By Lemma the following alternative expression of /n(x) 
holds trne: 


/n(x) = - 


n 


i=l 


{U} < r/1 or 

I- ^ - ([fca:i]) 


or < Uf,, ~ hn ([0)^ • (3-9) 

4 — {[kx^])i ^ ^ 


Thns, bonnding the error |/i„—/i|([0, x is the same as bonnding /|(x). 


Asymptotic properties of this empirical connterpart have been stndied in 


Hnangf ( 

1992) 

, Drees and Hnang 

(1998) 

, Embrechts et ah 

(2000 

) and c 

e Haan 

and Ferreiraj ( 

2006 

) in the bivariate case, and 

Qi( 

199' 

f),E 

inmahl et ah 

(2012). 

in the general mnltivariate case 

. In 

Boix et a 

1 . (5 

>015 

), a non-asymptotic 


bonnd is established on the maximal deviation 


snp |/(x)-/„(x)| 
o<x<r 

for a hxed T > 0, or eqnivalently on 

snp |p([0,x]'=)-p„([0,x]")|. 

1/T<x 

The exponent measnre /i is indeed easier to deal with when restricted to the 
class of sets of the form [0,x]'^, which is fairly simple in the sense that it has 
hnite VC dimension. 

In the present work, an important step is to bonnd the error on the class 
of e-thickened rectangles This is achieved by nsing a more general class 
R{'x,z,a, (3), which inclndes (contrary to the collection of sets [0,x]'^) the 
. This flexible class is dehned by 

R{x,z,a,(3) = |y e [0, cx)]'^, yj>Xj for j e a, 

Vj < V for j e /? X, z e [0, cx)]'^. 

(3.10) 


Thns, 


(i?(x,z,a,/3)) 



2=1 


for j&a and V? < ^Xj for je/3} 
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Then, define the functional (which plays the same role as the STDF) 
as follows: for x G [0 ,cxd]'^ \ {oo}, z G [0, cxd]'^, a C d} \ 0 and 

/3 C {1,..., d}, let 


9a,/3(x,z) 

Fa,/ 3 (x,z) 


lim tFa,,3(i ^x,t ^z), with 

t^OO ’ 


P 


{w < 


X-i 


for j G a} {W > Zj 


(3.11) 

for j G /?} . 

(3.12) 


Notic e th at F^ six, z ) is an extension of the non-asymptotic approximation 
F in (3.6). By (3.11) and (3.12), we have 

5fa^/3(x, z) = lim tP < t~^Xj for j G a} [W > t~^Zj for j G /?} 


t^OO 

= lim fP Tv G fi?(x“^, z~^, a, /3)1 , 


so that using (2.4), 


9 o,;i(x,z) = ;i(|fl(x \z /?)]) 


(3.13) 


The following lemma makes the relation between Qa^is and the angular 
measure <1> explicit. Its proof is given in [Appendix A 


Lemma 4. The function can be represented as follows: 

gaAx, z) = / A WjXj - y WjZj <h(dw) , 

A-i Viea J ^ 

where u Av = min{M,n}, uV v = max{M, n} and u+ = max{M, 0} for any 
{u,v) G M^. Thus, ga,y is homogeneous and satisfies 

|fi'a,/3(x, z) - 5 f„,/ 3 (x', z)\ < ^ \Xj - x'-1 + ^ \Zj - z'^\ , 

tea je/3 

Remark 3. Lemma^ shows that the functional ga^p, which plays the same 
role as a the STDF, enjoys a Lipschitz property. 

We now define the empirical counterpart of (mimicking that of the 
empirical STDF in (|3.8[) ) by 


9n,a,p{x,z) — 


2=1 


(n-Lfex,.J+i) for jea and for ie/3} ' 

(3.14) 
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As it is the case for the empirical STDF (see (3.9)), gn,a,i 3 has an alternative 
expression 


5'n,a,/3(x,z) = ^ 5^1 


2 = 1 


{Ui < for j&a and Uf > for je/3} 


= Hn (i?(x \z \a, /?)) , 


(3.15) 


where the last eqnality comes from the eqnivalence Vl > -v^ Ul < 

k ^i=i ""Vief(-)’ 


U^,,, (Lemma 

{\kx- j) ^ 


and from the expression /i„( ■) = | '^ 1=1 


nition (3.2). 

The proposition below extends the resnlt of Goix et ah (2015), by deriving 
an analogue upper bound on the maximal deviation 

max sup | 5 („,^(x,z) - 5 („,^,^(x,z)| , 

“iP 0<x,z<T 


or equivalently on 


max sup |p(i?(x, z, a,/d)) —/i„(i?(x, z, a,/d))| . 

1/T<x,z 

Here and beyond we simply denote ‘a, /d’ for ‘a non-empty subset of {1,..., d}\ 
0 and /d subset of {1,... ,dy. We also recall that comparison operators be¬ 
tween two vectors (or between a vector and a real number) are understood 
component-wise, i.e. ‘x < z’ means ‘Xj < Zj for all 1 < j < d’ and for any 
real number T, ‘x < T’ means ‘Xj < T for all 1 < J < d\ 

Proposition 1. Let T > |(^^ + 1), and 6 > e~^. Then there is a universal 
constant C, such that for each n > 0, with probability at least 1 — d, 


max sup | 5 („,„,^(x,z) - 5 (o,^(x,z)| < Cdxf^ 

«./3 o<x.z<r V fc d 


(3.16) 


• 4 - max sup 

0<x,z<2T 


n~kk 

^i^«,/3(-X, -z) -^«,^(x,z) 


The second term on the right hand side of the inequality is an asymptotic 
bias term which goes to as n ^ oo (see Remark 12). 


The proof follows the same lines as that of Theorem 6 in Goix et ah (2015) 
and is detailed in Appendix A Here is the main argument. 
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The empirical estimator is based on the empirical measure of ‘extreme’ 
regions, which are hit only with low probability. It is thus enough to bound 
maximal deviations on such low probability regions. The key consists in 
choosing an adaptive VC class which only covers the latter regions (after 
standardization to uniform margins), namely a VC class composed of sets 




In 


Goix et al. 


(2015[), VC-type inequalities 


of the kind , z“ , a, (3) 

have been established that incorporate p, the probability of hitting the class 
at all. Applying these inequalities to the particular class of rectangles gives 
the result. 


3.4- Bounding — p\{Ra) uniformly over a 

The aim of this subsection is to exploit the previously established bound 
on the deviations on rectangles, to obtain another uniform bound for |/i„ — 
/i|(i?^), for e > 0 and a C {1,..., d}. In the remainder of the paper, a 
denotes the complementary set of a in {1,..., d}. Notice that directly from 
their dehnitions (2.13) and (3.10), and i?(x, z, a,/3) are linked by: 


= R{e, e, a, a) fl [0,1]'’ = dd(e, e, a, a) \ R{e, e, a, {1,..., d}) 


where e is dehned by ej = Ij^q + elj^a for all j G {1,..., d}. Indeed, we 
have: R{e, e, a, a) fl [0,1] = R{e, e, d}). As a result, for e < 1, 


sup \pn - hi {Ra) < 2 sup |hn “ hi (-^(x, Z, «, «)) . 

e<x,z 


On the other hand, from (3.15) and (3.13) we have 


sup |hn-hl(-R(x, Z, a, a)) = sup \gn,a,ai^, z) - ga,ai^, z) \ . 

f:<x,z 0<x,z<e-l 


Then Proposition [^applies with T = 1/e and the following result holds true. 

Corollary 1. Let 0 < e < + 1)) and 5 > e Then there is a 

universal constant C, such that for each n > 0, with probability at least 1 — d, 


max sup Khn 

^ €<X,Z 


T)iK)\ < Cd 



log 


d + 3 
d 


-|- max sup 

0<x,z<2e-l 



k 

-X, 

n 



(3.17) 


da,/3(x,z) 
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3.5. Bounding — iJi{Ca)\ uniformly over a 

In this section, an upper bound on the bias induced by handling e- 
thickened rectangles is derived. As the rectangles defined in (2.13) do 

not correspond to any set of angles on the sphere we also define the 

{e,e')-thickened cones 

= {v > 0 , ||v||oo > 1 , Vj > e||v||oo for j e a, vj < e'||v||oo for j ^ a}, 

(3.18) 

which verify C C Define the corresponding (e, e')-thickened sub¬ 
sphere 

= {x G 5*^^ Xi> e for i e a , xi < d for i ^ a} = n 

(3.19) 

It is then possible to approximate rectangles by the cones and 
and then by ) in the sense that 


$(Df) = /i(Q°) < y^{Rl) < /r(0 = <I>(D°’^). 


(3.20) 


The next result (proved in Appendix A) is a preliminary step toward a 
bound on |/i(i?a) ~ /n(Ca)|- It is easier to use the absolute continuity of d* 
instead of that of /i, since the rectangles are not bounded contrary to the 
sub-spheres . 

Lemma 5. For every 0 7 ^ a C {1,..., d} and 0 < e, e' < 1/2, we have 
|$(D^’^') - *h(D^)| < M\a\^e + Mde' . 


Now, notice that 

$(D^’°) - $(D„) < - /x(C„) < «I>(D°’^) - $(D„). 

We obtain the following proposition. 

Proposition 2. For every non empty set of indices 0 7 ^ a C {1,..., d} and 
e > 0, 

|/i(/2^)-/i(C„)| <Md2e 
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3.6. Main result 

We can now state the main resnlt of the paper, revealing the accnracy of 


the estimate (3.3). 


Theorem 1. There is an universal constant C > 


n, k, e, 6 verifying 6>e^,0<e< 


0 such that for every 
the 


1/2 and 6 < + 1)) 


following ineguality holds true with probability greater than 1 — 5.' 


II-M--MII00 < Cd [ J^\og^ + Mde 


4 max 

a C 


snp 

0<x,z<^ 


n ~ k k 

-F„,o(-x, -z) - 
k n n 


Note that + 1) is smaller than 4 as soon as \ogd/k < 1/7, so that a 

snfficient condition on e is e < 1/4. The last term in the right hand side 
is a bias term which goes to zero as n —>■ 00 (see Remark 12). The term 


Mde is also a bias term, which represents the bias indnced by considering e- 
thickened rectangles. It depends linearly on the sparsity constant M dehned 
in Assnmption The valne k can be interpreted as the effective nnmber of 
observations nsed in the empirical estimate, i.e. the effective sample size for 
tail estimation. Considering classical ineqnalities in empirical process theory 
snch as VC-bonnds, it is thns no snrprise to obtain one in 0{l/\/k). Too 
large valnes of k tend to yield a large bias, whereas too small valnes of k 
yield a large variance. For a more detailed discnssion on the choice of k we 


recommend Einmahl et al. (2009). 


The proof is based on decomposition (3.5). The hrst term snp„ \Tn{R 


/i(i?^)| on the right hand side of (3.5) is bonnded using Corollary]^ while 
Proposition allows to bound the second one (bias term stemming from the 
tolerance parameter e). Introduce the notation 


bias(a, n, fc, e) = 4 sup 

0<x,z<^ 

— — e 

With probability at least 1 — 5, 


n 


.k 

n 


-Fa,s(-x, -zl 


k_ 

n 


-9c 


X, z, 


(3.21) 


V 0 7 ^ a C {1,..., d}, 

|/r„(i?^) - p(Cq,)| < Cd\j ^ ^ ^ + bias(a,?7,, fc,e)+ Md^e . 
The upper bound stated in Theorem [T] follows. 
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Remark 4. (Thresholding the estimator^ In practice, we have to deal 
with non-asymptotic noisy data, so that many Ai{a) ’s have very small values 
though the corresponding Ai{a) ’s are null. One solution is thus to define a 
threshold value, for instance a proportion p of the averagedjmass over all 
the faces a with positive mass, i.e. threshold = p\A\~^'Yl,a-^^^) ^ ~ 

{a, A4(a) > 0} . Let us define A4{a) the obtained thresholded Ai(a). Then 
the estimation error satisfies: 

il^--M||oo < ||^--M||oo + ||A?--M||oo 


<P\A\ +p\A\ ^^\M{a) - M{a)\ 

a a 

+ \\M-M\U 

< (p+i)iiAi->fiu+pi>i|-v(io.ir). 


It is outside the scope of this paper to study optimal values for p. However, 
Remark\^ writes the estimation procedure as an optimization problem, thus 
exhibiting a link between thresholding and -regularization. 


Remark 5. (Underlying risk minimization problems) Our estimate 
A4(a) can be interpreted as a solution of an empirical risk minimization 
problem inducing a conditional empirical ri^ Rn. When adding a regu¬ 
larization term to this problem, we recover M(a), the thresholded estimate. 

First recall that Ai(a) is defined for a C {1, ..., d}, a ^ tj) by M.[a) = 
1/k X]r=i ■ "4s C [0,1]'^, we may write 


M ( o .) = ffp4 \\%\\ > 1) 


1 

it Vnm-ViW > 1 ) 


n 


where the last term is the empirical expectation of Z^fiot) = condi- 

tionnaly to the event {|||Vi|| > 1}, and i\\ > 1) = ^ Er=i l^||Uli>i- 

According to Lemma for each fixed margin j, Vf > | if, and only if 
Xl > which happens for k observations exactly. Thus, 


VnHl^lW > 1 ) 

n 


n 


El 

2 = 1 


W>7 


k dk 
n’ n 
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we then have 


If we define k = k(n) G [/c, dk] such that 'Pn(|||Vi|| > 1) = 


M{a) 


k 

k 

k 

k 


1 ■G 

'Pn(JI|V.|| >1) ) 


n 

argmin^^>o V(Z„,i(a) - m^)^! ^ 

f ^ n'' ''ll — 

2 = 1 


Considering now the (2'^ — 1)-vector A4 and ||.|| 2 ,a the L'^-norm on ^, we 
immediatly have (since k{n) does not depend on a) 

, f , . 

A'l = - argmin^gg2d-i Rn{m), (3.22) 

where Rn{rn) = Y^=i q1 *liiv ll>i -empirical risk ofm, re- 

striated to extreme observations, namely to observations lK.i satisfying ||Vj|| > 
Then, up to a constant | = 0(1), Ai is solution of an empirical condi¬ 
tional risk minimization problem. Define the non-asymptotic theoretical risk 
Rn{m) for m G by 


Rnijn) = E 


m 


2,a 


|-Vii|oo>l 

n 


with Zn '■= Zn,i. Then one can show (see Appendix A) that Zn, condition¬ 
ally to the event {||^Vi|| > 1}, converges in distribution to a variable Zoo 
which is a multinomial distribution on with parameters {n = = 

tyloiY) ’ ^ {1; ■ ■ ■; '''■}) 7^ 0)- In other words. 


P(Zoo(a) = 1) 


KRl) 

M[o,i]=) 


for a// a G {1,, n}, a ^ and ^oo(^) = 1- Rhus Rn(fn) —)■ Roo{m) := 
E[||Zoo —'"i-llia]; which is the asymptotic risk. Moreover, the optimization 
problem 


min 


Roo{m) 


admits m 


( ^Q^(fi]c) ; OL C {1,..., n}, a as solution. 
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Considering the solution of the minimization problem (3.22), which hap 


pens to coincide with the definition of Ai, makes then sense if the goal is 
to estimate M. := G 7^ 0 ). As well as considering 

thresholded estimators Ai(a), since it amounts (up to a bias term) to add a 
-penalization term to the underlying optimization problem: Let us consider 


mm 


Rn { rn ) + A||mi|i,„ 


with ||m||i^a = \m{a)\ the norm on In this optimization prob¬ 

lem, only extreme observations are involved. It is a well known fact that 
solving it is eguivalent to soft-thresholding the solution of the same problem 
without the penality term - and then, up to a bias term due to the soft- 
thresholding, it boils down to setting to zero features m{a) which are less 
than some fixed threshold T(A). This is an other interpretation on threshold¬ 
ing as defined in Remark\^ 


4. Application to Anomaly Detection 


4.1. Background on AD 

What is Anomaly Detection ? From a machine learning perspective, AD 
can be considered as a specific classification task, where the usnal assnmp- 
tion in snpervised learning stipulating that the dataset contains structural 
information regarding all classes breaks down, see Roberts (1999). This typ¬ 


ically happens in the case of two highly unbalanced classes: the normal class 
is expected to regroup a large majority of the dataset, so that the very small 
number of points representing the abnormal class does not allow to learn in¬ 
formation about this class. Supervised AD consists in training the algorithm 
on a labeled (normal/abnormal) dataset including both normal and abnormal 
observations. In the semi-supervised context, only normal data are available 
for training. This is the case in applications where normal operations are 
known but intrusion/attacks/viruses are unknown and should be detected. 
In the unsupervised setup, no assumption is made on the data which consist 
in unlabeled normal and abnormal instances. In general, a method from the 
semi-supervised framework may apply to the unsupervised one, as soon as 
the number of anomalies is sufficiently weak to prevent the algorithm from 
fitting them when learning the normal behavior. Such a method should be 
robust to outlying observations. 






Extremes and Anomaly Detection. As a matter of fact, ‘extreme’ ob¬ 
servations are often more susceptible to be anomalies than others. In other 
words, extremal observations are often at the border between normal and ab¬ 
normal regions and play a very special role in this context. As the number of 
observations considered as extreme {e.g. in a Peak-over-threshold analysis) 
typically constitute less than one percent of the data, a classical AD algo¬ 
rithm would tend to systematically classify all of them as abnormal: it is not 
worth the risk (in terms of ROC or precision-recall curve for instance) trying 
to be more accurate in low probability regions without adapted tools. Also, 
new observations outside the ‘observed support’ are most often predicted as 
abnormal. However, false positives {i.e. false alarms) are very expensive in 
many applications {e.g. aircraft predictive maintenance). It is thus of primal 
interest to develop tools increasing precision {i.e. the probability of observing 
an anomaly among alarms) on such extremal regions. 

Contributions. The algorithm proposed in this paper provides a scoring 
function which ranks extreme observations according to their supposed de¬ 
gree of abnormality. This method is complementary to other AD algorithms, 
insofar as two algorithms (that described here, together with any other ap¬ 
propriate AD algorithm) may be trained on the same dataset. Afterwards, 
the input space may be divided into two regions - an extreme region and 
a non-extreme one- so that a new observation in the central region {resp. 
in the extremal region) would be classihed as abnormal or not according to 
the scoring function issued by the generic algorithm {resp. the one presented 
here). The scope of our algorithm concerns both semi-supervised and unsu¬ 
pervised problems. Undoubtedly, as it consists in learning a ‘normal’ {i.e. not 
abnormal) behavior in extremal regions, it is optimally efficient when trained 
on ‘normal’ observations only. However it also applies to unsupervised situ¬ 
ations. Indeed, it involves a non-parametric but relatively coarse estimation 
scheme which prevents from over-htting normal data or htting anomalies. As 
a consequence, this method is robust to outliers and also applies when the 
training dataset contains a (small) proportion of anomalies. 

4.2. Algorithm: Detecting Anomalies among Multivariate Extremes (DAMEX) 
The purpose of this subsection is to explain the heuristic behind the use 
of multivariate EVT for Anomaly Detection, which is in fact a natural way to 
proceed when trying to describe the dependence structure of extreme regions. 
The algorithm is thus introduced in an intuitive setup, which matches the 
theoretical framework and results obtained in sections |2] and HI The notations 
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are the same as above: X = (X^,..., X^) is a random vector in with joint 
[resp. marginal) distribution F {resp. Fj, j = 1,... ,d) and Xi,....,X„ ~ F 
is an i.i.d. sample. The hrst natural step to study the dependence between 
the margins is to standardize them, and the choice of standard Pareto 
margins (with c.d.f. x h-)■ l/x) is convenient: Consider thus the Vj’s and Vj’s 
as dehned in Section One possible strategy to investigate the dependence 
structure of extreme events is to characterize, for each subset of features 
a C ,(i}, the ‘correlation’ of these features given that one of them at 
least is large and the others are small. Formally, we associate to each such a 
a coefficient Ai{a) reffecting the degree of dependence between the features 
a. This coefficient is to be proportional to the expected number of points 
Vj above a large radial threshold (||V||oo > r), verifying V/ ‘large’ for j G a, 
while ‘small’ for j ^ a. In order to dehne the notion of ‘large’ and ‘small’, 
£x a (small) tolerance parameter 0 < e < 1. Thus, our focus is on the 
expected proportion of points ‘above a large radial threshold’ r which belong 
to the truncated rectangles dehned in (2.13). More precisely, our goal is 


to estimate the above expected proportion, when the tolerance parameter e 
goes to 0. 

The standard empirical approach -counting the number of points in the 
regions of interest- leads to estimates Xi(a) = fin{Ra) (see (3.3)), with pin 


the empirical version of pi dehned in (3.2), namely: 


_ 'Yl 

M{a) = ^^n{K) = 


(4.1) 


where we recall that P„ = (l/n) 'Y^=i empirical probability distri¬ 

bution of the rank-transformed data, and k = k{n) > 0 is such that A; —)■ oo 
and k = o(n) as n —)■ oo. The ratio n/k plays the role of a large radial 
threshold r. From our standardization choice, counting points in {n/k)R^^ 
boils down to selecting, for each feature j < d, the 'k largest values’ Xf 
among n observations. According to the nature of the extremal dependence, 
a number between k and dk of observations are selected: k in case of perfect 
dependence, dk in case of ‘independence’, which means, in the EVT frame¬ 
work, that the components may only be large one at a time. In any case, the 
number of observations considered as extreme is proportional to k, whence 
the normalizing factor 

The coefficients associated with the cones Ca constitute 

our representation of the dependence structure. This representation is sparse 
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as soon as the A4{a) are positive only for a few gronps of featnres a (com¬ 
pared with the total nnmber of gronps, or snb-cones, 2'^ — 1). It is is low¬ 
dimensional as soon as each of these gronps has moderate cardinality |q!|, i.e. 
as soon as the snb-cones with positive Ai{a) are low-dimensional relatively 
to d. 

In fact, np to a normalizing constant, A4{a) is an empirical version of the 
probability that T(X) belongs to the cone Ca, conditioned npon exceeding a 
large threshold. Indeed, for r,n and k snfficiently large, we have (Remark 
and (3.4), reminding that V = T(X)) 


M{a) ~ C'P(T(X) G | ||T(X)|| > r) 
Introdnce an ‘angnlar scoring fnnction’ 


Wr, 


(4.2) 


For each fixed (new observation) x, Wni'x.) approaches the probability that the 
random variable X belongs to the same cone as x in the transformed space. 
In short, tCn(x) is an empirical version of the probability that X and x have 
approximately the same ‘direction’. For AD, the degree of ‘abnormality’ of 
the new observation x should be related both to Wn(x) and to the uniform 
norm ||T(x)||oo (angular and radial components). More precisely, for x fixed 
such that T(x) G Consider the 'directional tail region^ induced by x, 
4lx = {y : T(y) G , ||T(y)||oo > ||T(x)||oo}. Then, if ||T(x)||oo is large 
enough, we have (using (2.6[)) that 


P(X G Ax) =P(V G ||T(x) 


\ooK) 


= P(||V||>||T(x)||) P(Vg||T(x)|UR^ 
-CPdlVll > ||T(x)||) M(a) 


|V|| > ||T(x) 


= C||T(x)i|-^m 
This yields the scoring function 


X . 


Wr. 


X : = 


l|r(x)ll< 


(4.3) 


which is thus (up to a scaling constant C) an empirical version of P(X G Ax): 
the smaller s„(x), the more abnormal the point x should be considered. 
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As an illustrative example, Figure displays the level sets of this scoring 
function, both in the transformed and the non-transformed input space, in 
the 2D situation. The data are simulated under a 2D logistic distribution 
with asymmetric parameters. 


levels sec of DBAD scoring 




Figure 4: Level sets of on simulated 2D data 


This heuristic argument explains the following algorithm, referred to as 
Detecting Anomaly with Multivariate Extremes (DAMEX in abbreviated 
form). Note that this is a slightly modihed version of the original DAMEX al- 


gorihtm empirically tested in Goix et ah (2016), where e-thickened sub-cones 


instead of e-thickened rectangles are considered. The proof is more straight¬ 
forward when considering rectangles and performance remains as good. The 
complexity is in 0{dn\ogn -|- dn) = 0{dn\ogn), where the hrst term on the 
left-hand-side comes from computing the Fj{Xl) (Step 1) by sorting the data 
{e.g. merge sort). The second one arises from Step 2. 
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Algorithm 1. (DAMEX) 

Input: parameters e > 0, k = k{n), p > 0. 

1. Standardize via marginal rank-transformation: Vj := (l/(l — 

fiW))),... 

2. Assign to each V* the cone it belongs to. 

3. Compute AA{a) from 
non-zero mass. 

4. (Optional) Set to 0 the M.{a) below some small threshold defined 
in remar w.r.t. p.—)• yields: (sparse) representation of the depen¬ 
dence structure 


(4-1) —^ yields: (small number of) cones with 


a) -. 0a C {1,..., (i}|. 


Output: Compute the scoring function given by (4-3), 
s„(x) := (l/||f(x)||oo)5^-M(a)lf(^)g^.. 


(4.4) 


Before investigating how the algorithm above empirically performs when 
applied to synthetic/real datasets, a few remarks are in order. 


Remark 6. (Interpretation of the Parameters^ In view of (f.l) 


n/k is the threshold above which the data are considered as extreme and k is 
proportional to the number of such data, a common approach in multivariate 
extremes. The tolerance parameter e accounts for the non-asymptotic nature 
of data. The smaller k, the smaller e shall be chosen. The additional angular 
mass threshold in step 4- o,cts as an additional sparsity inducing parameter. 
Note that even without this additional step (^i.e. setting p = 0, the obtained 
representation for real-world data (see Table^ is already sparse (the number 
of charges cones is significantly less than 2^). 


Remark 7. ('Choice of Parameters j A standard choice of parameters 
(e, k, p) is respectively (0.01, 0.1). However, there is no simple manner 

to choose optimally these parameters, as there is no simple way to determine 
how fast is the convergence to the (asymptotic) extreme behavior -namely 
how far in the tail appears the asymptotic dependence structure. Indeed, even 
though the first term of the error bound in Theorem is proportional, up to 
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re-scaling, to which suggests choosing e of order k the unknown 

bias term perturbs the analysis and in practice, one obtains better results with 
the values above mentioned. In a supervised or semi-supervised framework (or 
if a small labeled dataset is available) these three parameters should be chosen 


by cross-validation. In the unsupervised situation, a classical heuristic (Coles 


(200 f) ) is to choose {k,e) in a stability region of the algorithm’s output: 
the largest k (^resp. the larger e) such that when decreased, the dependence 
structure remains stable. This amounts to selecting as many data as possible 
as being extreme (^resp. in low dimensional regions), within a stability 
domain of the estimates, which exists under the primal assumption (2.1) and 
in view of Lemma 


Remark 8. (Dimension Reduction ) If the extreme dependence structure 
is low dimensional, namely concentrated on low dimensional cones Ca - or in 
other terms if only a limited number of margins can be large together - then 
most of the D ’s will be concentrated on the ’s such that |a| (the dimension 
of the cone Ca) is small; then the representation of the dependence structure 
in (f.f) is both sparse and low dimensional. 


Remark 9. (Scaling Invariance^ DAMEX produces the same result if 
the input data are transformed in such a way that the marginal order is pre¬ 
served. In particular, any marginally increasing transform or any scaling as 
a preprocessing step does not affect the algorithm. It also implies invariance 
with respect to any change in the measuring units. This invariance prop¬ 
erty constitutes part of the strengh of the algorithm, since data preprocessing 
steps usually have a great impact on the overall performance and are of major 
concern in pratice. 


5. Experimental results 

5.1. Recovering the support of the dependence structure of generated data 

Datasets of size 50000 (respectively 100000, 150000) are generated in 
according to a popular multivariate extreme value model, introduced by 
), namely a multivariate asymmetric logistic distribution {Giog). 
The data have the following features: (i) they resemble ‘real life’ data, that 
is, the Xl’s are non zero and the transformed Vfs belong to the interior 
cone (ii) the associated (asymptotic) exponent measure concentrates 

on K disjoint cones {Ca„,l < m < K}. For the sake of reproducibility. 


Tawn (1990 
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Giogi^) = exp{-^^^^ "*}> where \A{j)\ is the 

cardinal of the set {a E D : j E a} and where Wa^ = 0.1 is a dependence 
parameter (strong dependence). The data are simulated using Algorithm 2.2 


m 


Stephenson (2003). The subset of sub-cones D charged by /i is randomly 


chosen (for each fixed number of sub-cones K) and the purpose is to recover D 
by Algorithm [T| For each K, 100 experiments are made and we consider the 
number of ‘errors’, that is, the number of non-recovered or false-discovered 
sub-cones. Table shows the averaged numbers of errors among the 100 
experiments. The results are very promising in situations where the number 


sub-cones K 

3 

5 

10 

15 

20 

25 

30 

35 

40 

45 

50 

Aver. errors 

(n=5e4) 

0.02 

0.65 

0.95 

0.45 

0.49 

1.35 

4.19 

8.9 

15.46 

19.92 

18.99 

Aver. errors 

(n=10e4) 

0.00 

0.45 

0.36 

0.21 

0.13 

0.43 

0.38 

0.55 

1.91 

1.67 

2.37 

Aver. errors 

(n=15e4) 

0.00 

0.34 

0.47 

0.00 

0.02 

0.13 

0.13 

0.31 

0.39 

0.59 

1.77 


Table 1: Support recovering on simulated data 


of sub-cones is moderate w.r.t. the number of observations. 

5.2. Sparse structure of extremes (wave data) 

Our goal is here to verify that the two expected phenomena mentioned in 
the introduction, 1 - sparse dependence structure of extremes (small number 
of sub-cones with non zero mass), 2 - low dimension of the sub-cones with 
non-zero mass, do occur with real data. We consider wave directions data 
provided by Shell, which consist of 58585 measurements Hj, i < 58595 of 
wave directions between O'’ and 360'’ at 50 different locations (buoys in North 
sea). The dimension is thus 50. The angle 90° being fairly rare, we work 
with data obtained as X) = 1/(10“^° -|- |90 — Dl\), where Dl is the wave 
direction at buoy j, time i. Thus, H^’s close to 90 correspond to extreme 
X/’s. Results in Table [^how that the number of sub-cones Ca identified by 
Algorithm is indeed small compared to the total number of sub-cones (2^°- 
1). (Phenomenon 1 in the introduction section). Further, the dimension of 
these sub-cones is essentially moderate (Phenomenon 2): respectively 93%, 
98.6% and 99.6% of the mass is affected to sub-cones of dimension no greater 
than 10, 15 and 20 respectively (to be compared with d = 50). Histograms 
displaying the mass repartition produced by Algorithm are given in Fig. 
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dimensional repartition - non extreme data 
(beiow threshold, infinity norm) 


face dimension 


dimensional repartition - extreme data 



face dimension 


Figure 5: sub-cone dimensions of wave data 



non-extreme data extreme data 

nb of sub-cones with mass > 0 (p = 0) 
idem after thresholding {p = 0.1) 
idem after thresholding {p = 0.2) 

3413 858 

2 64 

1 18 


Table 2: Total number of sub-cones of wave data 


5.3. Application to Anomaly Detection on real-world data sets 

The main purpose of Algorithm is to build a ‘normal prohle’ for ex¬ 
treme data, so as to distinguish between normal and ab-normal extremes. 
In this section we evaluate its performance and compare it with that of a 
standard AD algorithm, the Isolation Forest (iForest) algorithm, which we 
chose in view of its established high performance (Liu et ah (2008)). The two 
algorithms are trained and tested on the same datasets, the test set being 
restricted to an extreme region. Five reference AD datasets are considered: 
shuttle, forestcover, http, SF and SA The experiments are performed in a 
semi-supervised framework (the training set consists of normal data). 

The shuttle dataset is the fusion of the training and testing datasets 
available in the UCI repository Lichman (2013). The data have 9 numerical 
attributes, the hrst one being time. Labels from 7 different classes are also 
available. Class 1 instances are considered as normal, the others as anoma¬ 
lies. We use instances from all different classes but class 4, which yields an 
anomaly ratio (class 1) of 7.17%. 


^These datasets are available for instance on http://scikit-learn.org/dev/ 
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In the forestcover data, also available at UCI repository (Lichman ( 2013| )), 
the normal data are the instances from class 2 while instances from class 4 
are anomalies, other classes are omitted, so that the anomaly ratio for this 
dataset is 0.9%. 


The last three datasets belong to the KDD Cup ’99 dataset (KDDCup 


(1999), Tavallaee et ah (2009)), produced by processing the tcpdump portions 
of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset, 
created by MIT Lincoln Lab Lippmann et al. ( |2000 ). The artificial data 
was generated using a closed network and a wide variety of hand-injected 
attacks (anomalies) to produce a large number of different types of attack 
with normal activity in the background. Since the original demonstrative 
purpose of the dataset concerns supervised AD, the anomaly rate is very 
high (80%), which is unrealistic in practice, and inappropriate for evaluating 
the performance on realistic data. We thus take standard pre-processing 
steps in order to work with smaller anomaly rates. For datasets SF and http 
we proceed as described in Yamanishi et al. (2000): SF is obtained by picking 
up the data with positive logged-in attribute, and focusing on the intrusion 
attack, which gives an anomaly proportion of 0.48%. The dataset http is a 
subset of SF corresponding to a third feature equal to ’http’. Finally, the 
SA dataset is obtained as in Eskin et al. (2002) by selecting all the normal 
data, together with a small proportion (1%) of anomalies. 

Table [3] summarizes the characteristics of these datasets. The thresh¬ 
olding parameter p is hxed to 0.1, the averaged mass of the non-empty 
sub-cones, while the parameters (/c, e) are standardly chosen as (n^/^,0.01). 
The extreme region on which the evaluation step is performed is chosen as 
{x : ||T(x)|| > \/n}, where n is the training set’s sample size. The ROC and 
PR curves are computed using only observations in the extreme region. This 
provides a precise evaluation of the two AD methods on extreme data. For 
each of them, 20 experiments on random training and testing datasets are 
performed, yielding averaged ROC and Precision-Recall curves whose AUC 
are presented in Table DAMEX signihcantly improves the performance 
(both in term of precision and of ROC curves) in extreme regions for each 
dataset, as illustrated in hgures and 

In Table we repeat the same experiments but with e = 0.1. This yields 
the same strong performance of DAMEX, excepting for SF. Generally, to 
large e may yield over-estimated AA{a) for low-dimensional faces a. Such 
a performance gap between e = 0.01 and e = 0.1 can also be explained 
by the fact that anomalies may form a cluster which is wrongly include in 
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some over-estimated ‘normal’ sub-cone, when e is too large. Such singular 
anomaly structure would also explain the counter performance of iForest on 
this dataset. 

We also point out that for very small values of epsilon (e < 0.001), the 
performance of DAMEX signihcantly decreases on these datasets. With such 
a small e, most observations belong to the central cone (the one of dimension 
d) which is widely over-estimated, while the other cones are under-estimated. 

The only case were using very small e should be useful, is when the 
asymptotic behaviour is clearly reached at level k (usually for very large 
threshold n/fc, e.g. k = or in the specific case where anomalies clearly 

concentrate in low dimensional sub-cones: The use of a small e precisely 
allows to assign a high abnormality score to these subcones (under-estimation 
of the asymptotic mass), which yields better performances. 

The averaged ROC curves and PR curves for the other datasets are gath¬ 
ered in [Appendix B 



shuttle 

forestcover 

SA 

SF 

http 

Samples total 

85849 

286048 

976158 

699691 

619052 

Number of features 

9 

54 

41 

4 

3 

Percentage of anomalies 

7.17 

0.96 

0.35 

0.48 

0.39 


Table 3: Datasets characteristics 


Dataset 

iForest 

AUC ROC AUC PR 

DAMEX 

AUC ROC AUC PR 

shuttle 

0.957 

0.987 

0.988 

0.996 

forestcover 

0.667 

0.201 

0.976 

0.805 

http 

0.561 

0.321 

0.981 

0.742 

SF 

0.134 

0.189 

0.988 

0.973 

SA 

0.932 

0.625 

0.945 

0.818 


Table 4: Results on extreme regions with standard parameters (fc, e) = 
(n^/^0.01) 

Considering the significant performance improvements on extreme data, 
DAMEX may be combined with any standard AD algorithm to handle ex¬ 
treme and non-extreme data. This would improve the global performance 
of the chosen standard algorithm, and in particular decrease the false alarm 
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Dataset 

iForest 

AUC ROC AUC PR 

DAMEX 

AUC ROC AUC PR 

shuttle 

0.957 

0.987 

0.980 

0.995 

forest cover 

0.667 

0.201 

0.984 

0.852 

http 

0.561 

0.321 

0.971 

0.639 

SF 

0.134 

0.189 

0.101 

0.211 

SA 

0.932 

0.625 

0.964 

0.848 


Table 5: Results on extreme regions with lower e = 0.1 


rate (increase the slope of the ROC curve’s tangents near the origin). This 
combination can be done by splitting the input space between an extreme 
region and a non-extreme one, then using Algorithm to treat new observa¬ 
tions that appear in the extreme region, and the standard algorithm to deal 
with those which appear in the non-extreme region. 


6. Conclusion 


The contribution of this work is twofold. First, it brings advances in 
multivariate EVT by designing a statistical method that possibly exhibits a 
sparsity pattern in the dependence structure of extremes, while deriving non- 
asymptotic bounds to assess the accuracy of the estimation procedure. Our 
method is intended to be used as a preprocessing step to scale up multivari¬ 
ate extreme values modeling to high dimensional settings, which is currently 
one of the major challenges in multivariate EVT. Since the asymptotic bias 


(bias(a, n, k, e) in eq. (3.21)) appears as a separate term in the bound estab¬ 
lished, no second order assumption is required. One possible line of further 
research would be to make such an assumption (he.to assume that the bias 
itself is regularly varying), in order to choose e adaptively with respect to k 
and n (see Remark]^. This might also open up the possibility of de-biasing 
the estimation procedure (Fougeres et al. (2015), Beirlant et ah (2015)). 


As a second contribution, this work extends the applicability of multivariate 
EVT to the held of Anomaly Detection: a multivariate EVT-based algorithm 
which scores extreme observations according to their degree of abnormality is 
proposed. Due to its moderate complexity -of order dn log n- this algorithm 
is suitable for the treatment of real word large-scale learning problems, and 
experimental results reveal a signihcantly increased performance on extreme 
regions compared with standard AD approaches. 
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ROC curve for shuttle 


t=0.01, 


PR curve for shuttle 




Figure 6: SF dataset, default parameters 


Acknowledgements 

Part of this work has been supported by the industrial chair ‘Machine 
Learning for Big Data’ from Telecom ParisTech, by the Ecole Normale Superieure 
de Cachan and by the AGREED project from PEPS JCJC INS2I 2015. 


Appendix A. Technical proofs 

Appendix A.l. Proof of Lemma 

For n vectors Vi,..., in let us denote by rank{vj) the rank of 
vj among that is rank(y{) = ZlLi = 

{rank{Xl) — l)/n. For the first equivalence, notice that Vf = l/Ul- For the 
others, we have both at the same time: 


n 

Vi > 


rankiXl) — 1 k 

^ 1 -- < - X- 

n n 

■v^ rank{Xl) > n — kxj^ + 1 
rank{Xi) > n — [kxj^\ + 1 
^ X- > 
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ROC curve for forestcover 


t=0.01, 


PR curve for forestcover 




Figure 7: SF dataset, larger e 


and 


XI > ^ rank{Xl) >n - [kx^ + 1 


rank{Fj{Xl)) > n — [kxj J + 1 (with probability one) 
rank{l — Fj{Xi)) < \_kx~^\ 

^ Ul< f//,. 

* {\kx. J) 


Appendix A. 2. Proof of Lemma 

First, recall that ( 7 q^^(x,z) = /r(i?(x“^, z“^, a,/3)), see (3.13). Denote by 
71 the transformation to pseudo-polar coordinates introduced in Section 


TT : [0, cxd]^ \ {0} —>• (0, cxd] X S[ 


d-l 


V ^ (r,6>) = (||v||oo, i|v|| 


viloo'v) 


Then, we have d(po7r = ^d<F on (0, oo] x This classical result from 

EVT comes from the fact that, for rg > 0 and B C pi o 7r“^{r > rg, 0 G 
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B] = Tq ^$(5), see (2.6). Then 


5 fa,/ 3 (x, z) = /i O TT ^|(r, 0) : Vi e a, > a:/ ; Vj G 13, r9j < Zj ^ 


/iOTT ^|(r,0) : r>\J{eiXi) r < l^iOjZj) 

i£a j^R 


-1 


'ees^-^ Jr>0 


i6/3 

dr 


'0eS“ 


'0eS“ 


((V(»«)^‘)''“ (A(''ry)“‘)'‘) d 4 (e) 

V *e« j&i3 / 

^ ( /\ OiXi - \J OjZj j d$(0), 


\i&a je/3 


which proves the hrst assertion. To prove the Lipschitz property, notice hrst 
that, for any hnite sequence of real numbers c and d, max* Cj — max* di < 
maxj(cj — di) and miUjCj — min* d, < maxj(ci — di). Thus for every x, z G 
[0, cxd]'^ \ {oo} and 6 G 



< 

< 

< 

< 



A - A + V “ V 

J£a j£a jei3 


meix{9jXj — 9jx)) + max(6',z' — 9jZj) 
j£a je/3 •' 

maxddx,- — a;'-1 + max dob' — Zj\ 

j&a ^ jef} ^ 


Hence, 

Ida,/3(X,Z) -^o,;3(x',z')| 


< 



( max do bo — x’, 

\ 3&a ^ 


+ 


max dob' — ^o| I d<h(0) . 

i6/3 ^ ■'7 
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Now, by (2.7) we have: 


/ max 0,1 a:,- 

/ed-l j£a 

'Joo 


X, 


d<h(0) = /i([0, X 




with X dehned as Xj = \xj — x' | for j G a, and 0 elsewhere. It suffices then 
to write: 


]") = 3j e a, Vj > \xj - x'A }) 


< 


jea 


Vj >\xj - xA }) 


jea 




Similarly, Jgd-i maxjg^0j|^' — Zj\ d$(0) < E, 

Appendix A. 3. Proof of Proposition 

The starting point is inequality (9) on p.7 in Goix et al. (2015) which 
bounds the deviation of the empirical measure on extreme regions. Let 
Cn(-) = ^Er=i^{Zie-} and C(x) = P(Z G ■) be the empirical and true 
measures associated with a n-sample Zi, ..., Z^^ of i.i.d. realizations of a 
random vector Z = {Z^, ..., with uniform margins on [0,1]. Then for 
any real number 6 > e~^, with probability greater than 1 — 5, 


n 

sup - 
o<x<r fC 


Cn(- X, OO ) -C(- X, OO 

n n 


IT 1 
< log-. 


(A.1) 


Recall that with the above notations, 0 < x < T means 0 < Xj < T for every 
j. The proof of Proposition [^follows the same lines as in Goix et al. (2015). 
The cornerstone concentration inequality (A.l) has to be replaced with 


n 

max sup — 

0<x,z<T k 
3jGa,Xj <T' 


c 


k , _i 
-R X , 

n 


■\a, (3) 


-1 


C 


k , 

-R X , 
n 


■\a, 13 ) 


-1 




(A.2) 


Remark 10 . Inequality (A.2) is here written in its full generality, namely 


with a separate constant T' possibly smaller than T. IfT' < T, we then have 
a smaller hound (typically, we may use T = 1/e and T' = 1). However, we 


only use (A.2) with T = T' in the analysis below, since the smaller bounds 


in T' obtained (on A(n) in (A.5)j would be diluted (by T(n) in (A.5)j. 
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Proof of (A.2). Recall that for notational convenience we write ‘a,/3’ for ‘a 
non-empty subset of {1,..., d} and [3 subset of {1,..., d}’. The key is to ap¬ 
ply Theorem 1 in Goix et ah (2015), with a VC-class which hts our purposes. 
Namely, consider 


jA — -At^t' — 

A — ^ 

•AT,T',a,l3 — — 

n 


\^AT,T',a,l3 

a,fi 


with 


13) 


-1 . 


X, z G Mf, 0 < X, z < T, 

3j G a,Xj < T'l , 


for T, T' > 0 and a, /d C {1,..., d}, a ^ A has VC-dimension V 4 = d, 
as the one considered in Goix et ah] (|2015). Recall in view of (3.10) that 


R(x ,z , a, P) = \y e [0, 00 ] , Vj < Xj for j G a, 


Vj > for i e /3 


= [a,b]. 


with a and b dehned by aj = 


0 for j G a 
Zj for j G /d 


and bj = 


Xj for j G a 
00 for j G /d 


Since we have 'iA G A C [|T', oo['^, the probability for a r.v. Z with 
uniform margins in [0,1] to be in the union class A = ^ ^ A) < 

P(Z G [^T', cx3['=) < < ^dV Ine quality ( [X^ is thus a 

direct consequence of Theorem 1 in Goix et ah (2015). □ 

Dehne now the empirical version An,o,^ of (introduced in (3.12)) as 


F 


n,a,0 


= -El 


{Uf<Xj for j£a and U^>Zj for jGjS} ’ 


(A.3) 


i=l 


for jGa and Uf> — Zj for jE/3} 


.No- 


sothat fT;,„,^(^x,^z) = I Er=i 

t 7 — n ^ 

tice that the t//’s are not observable (since Fj is unknown). In fact, Fn,a,is 
will be used as a substitute for gn,a ,0 (dehned in (3.14)) allowing to handle 
uniform variables. This is illustrated by the following lemmas. 
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Lemma 6 (Link between gn,a,i 3 and Fn^a,g)- The empirical version of 
and that of are related via 


gn,a,g{i^y z) 


-F 

^^n,a,/3 


tf 


j&a 



Proof. Considering the definition in (A.3) and (3.15), both sides are eqnal to 


Lemma 7 (Uniform bound on Fn^a,/sF deviations). For any finite T > 0, 
and 6 > , with probability at least 1 — 5, the deviation of Fn,a,y from Fa^f^ 

is uniformly bounded: 


max sup 

o<x,z<r 


n ~ k k n ~ k k 

TTn,a,/s{-^, -z) - -F„,/ 3 (-X, -Z, 
k n n k n n 


log). 


Proof. Notice that 


sup 

0<x,z<T 


n~ kkn~kk 
7 ^n,a,/3(-X, -z) - -^^^(-X, -z) 
k n n k n n 


n 

= sup - 

o<x,z<r ^ 


1 


n 


El 

2=1 


U,e|i?(x-l,z-l,a, /?)- 


P 


U G —Riyi z , a, j3) 
n 


-1 


and apply inequality (A.2) with T' = T. 


□ 


Remark 11. Note that the following stronger inequality holds true, when 
using (A. 2) in full generality, i.e. with T' < T. For any finite T,T' > 0, and 
6 > e~^, with probability at least 1 — 5, 


max sup 

o<x,z<r 

3j^a..,Xj<T' 


n ~ k k . 

~rFyi a X, Z 

k n n 


IhA 


k 

-X, 

n 



< Cd\l'^ log 


1 

5‘ 


The following lemma is stated and proved in Goix et ah (2015). 


Lemma 8 (Bound on the order statistics of U). Let 6 > e~^. For any 
finite positive number T > 0 such that T > 7/2((logd)//c + 1), we have with 
probability greater than 1 — 5, 


71 

V 1 < j < d, < 2T , (A.4) 
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and with probability greater than 1 — {d+ 1)5, 


[kxA 

max sup — 

0<Xj<T k 


n 

k 


IP 


< c 



1 

5 ■ 


We may now proceed with the proof of Proposition Using Lemma we 
may write: 


max sup z) - z) | 

o<x,z<r 


= max sup 

0<x,z<T 


- F 

. J- r> 


ul 


< A(n) + S(n) + T(n) . 


jea 


. h, 


ilkzjl) 


ie/3 


- W,/3(x,z) 

(A.5) 


with: 


A(n) = max sup 
“A o<x,z<r 




J6/3 


--F 

- -»■ rv 




([kzj\) 




[n) = max sup 
“A o<x,z<r 


T (n) = max sup 
“A o<x,z<r 




([fcZjJ) 


16/3 




9«./S ( ( fchlfajJ) 


n. 


- W,/3(x,z) 


Now, considering (A.4) we have with probability greater than 1 — 5 that for 
every 1 < j < d, , so that 


A(n) < max sup 

“A 0<x,z<2T 


n ~ f k k \ n ~ {k k 

TPn,a,g -X, -z - yFo^^g -X, -Z 
k \n n j k \n n 


Thus by Lemma with probability at least 1 — 25, 


/2T 1 

A(n) < Cd\l—log-. 
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Concerning T(n), we have the following decomposition: 


T (n) < max sup 

0<x,z<T 


n 


n 


A: 

[kxj\ \ ([kzj\ 


9a,0 


k ) \ k , 

jGa \ / je0 , 


max sup 

0<x,z<T 


9a,0 

=: Ti(n) + T 2 (n) . 


[kxj\ \ [ [kzj\ 


k 


jea 


k 


-^a,/3lX,Zj 




The inequality in Lemma allows us to bound the first term Ti(n): 


Ti (n) < C max sup 

“>/3 o<x,z<r ^ 
- - j&a 


[kxj} _ n^j 


k k 


j&0 


[kzj\ 


k k 


< 2C sup 

l<3<d 


k k 


SO that by Lemma with probability greater than 1 — (d + 1)(5: 


2T 1 

Ti(n) < Cd\l—log- 


Similarly, 


^ 2 {n) < 2C sup 


0<x<T 


[kXj\ 

k ^ 


< c^- 

k 


Finally we get, for every n > 0, with probability at least 1 — (d + 3)5, 


max sup | 5 („,„,^(x,z) - 5 („_^(x,z)| < A(n) + Ti(n) + T 2 (n) + S(n) 

“>P 0<x,z<T 


'2T 1 2d 

< Cd\l — log - + — + max sup 

k 0 k a,0 0<x.z<2T 


n ~ k k 

^i^a,/3(-X,-z) -^„,^(x,z) 


2T 1 

< C'd\ ——log- + max sup 

k 6 a,0 o<x,z<2r 


n~kk 

^i^a,/3(-X, -z) -^„,^(x,z) 
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Remark 12. (Bias termJ It is classical (see Qi (1991) p-174 for details) 
to extend the simple convergence (3.11) to the uniform version on [0,T]'^. It 
suffices to subdivide [0,T]'^ and to use the monotonicity in each dimension 
coordinate of ga^y and Fa,y- Thus, 


sup 

0<x,z<2T 


n ~ k k 


^ 0 


for every a and fl. Note also that by taking a maximum on a finite class we 
have the convergence of the maximum uniform bias to O; 


max sup 

0<x,z<2T 


n~kk 

-z) 

k n n 


^ 0 . 


(A.6) 


Appendix A.f. Proof of Lemma 

First note that as the B^’s form a partition of the simplex and that 
n 12/3 = 0 as soon as a (fi (3, we have 

y yDa 

Let us recall that as stated in Lemma |^, $ is concentrated on the (dis¬ 
joint) edges 

= {x : ||x||oo = 1, Xjg = 1, 0 < Xj < 1 for z e a \ {zq} 

Xi = 0 for z ^ a } 


and that the restriction of <F to is absolutely continuous w.r.t. the 
Lebesgue measure dxQ,\jQ on the cube’s edges, whenever |a| > 2. By (2.15) 
we have, for every /3 D a, 


= Y. f , 

ioS/S 

4(SJ.) = Y [ 


d<h 


yio 


dx 


hVo 


x) dx^\ig 


lo&a 


KlXfy\ ir\ 
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Thus, 


mr') - m.) = E E 1.,, 


d$/3,io 


/3Da*oe/3 dx^Vo 

- E 


x) dxp\i^ 


d<h 


a,io 


ioScf 


dx, 


a\io 


x) dx«\i(, 


d® 


EE / 

l3Da ioe/3 

- E 


x) dx^vo 


d$, 


0,20 


ZqGCK rificK^iQ) ^^a\20 


x) cixQ,'y2Q, 


so that by ec 2.16 


|4(nS-')-4>(Si«)l < E^-fflE / , 

+ Ma y~^ 


/3Do ioS/3 


ioe« ''^a,iQ 


(A.7) 

dxQ,\jQ . 


Without loss of generality we may assume that a = {1, ■■■,K} with K < d. 
Then, for /3 ^ a, dxmj„ is smaller than (e')l/^l“l"l and is null as 

soon as io ^ /d \ «• To see this, assume for instance that (3 = {1,..., P} with 
P > K. Then 

ny n hl/3,io = {e < Xi,..., Xx < 1, xk+1, ..., Xp < e', Xj^ = 1, 

xp+i = ... = Xrf = 0 } 

which is empty ii iQ> K + 1 {i.e. io & f3 \ a) and which fulhlls if zq E 77 

[ , dx^vo < 


The hrst term in (A.7) is then bounded by Tfa|Q!|(e')l^l Now, con¬ 
cerning the second term in (A.7), fl ^a,io = E l,a^io ~ 

1 , xk+ 1 , •••, Xd = 0 } and then 

^«,io \ ^a,io) = IJ ^a,io n {x; < c}, 

1=1,...,K 
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so that ~ second term in (A.7) is 


thus bounded by M|ape. Finally, ( |A.7[ ) implies 

+M|a| 2 e. 

/3Da 

To conclude, observe that by Assumption 

Mp < e'M 

/32« /32a |/3|>2 


The result is thus proved. 

Appendix A. 5. Proof of Remark 

Let us prove that conditionally to the event {||^Vi||oo > 1}, converges 
in law. Recall that is a {2^ — l)-vector dehned by Zn{a) = for 

all a C {!,..., d}, a ^ 0. Let us denote = (lj=o)j=i,..., 2 d-i where we 
implicitely dehne the bijection between 'P({ 1 ,..., d}) \ 0 and { 1 ,..., 2 '^ — 
1}. Since the R^’s, a varying, form a partition of [0,1]'^, P(3a, = 

la I II^Villoo > 1) = 1 and ^n(a) = 1 ^Vi G so 


‘-Q I 

that 


E 


d)(Z„)l||fe 


Vl ||oo>l 


5;<l.(UP(Z„(a) = l). 


Let <F : 


—)■ M-l be a measurable function. Then 




,, k 


k 

-1 

r -| 

E 

_1 

-Vl oo>l 
n 

= P 

-Vl oo>l 
n 

E 

$(2'„)l|l|Vi||„,>i 


Now, P [II^Villoo > l] = with 7 r„ -)■ /x([0, l]'^), so that 


E 


I ||-Vi||oo>l 

n 




Using IV |Z„(a) = 1] = f P [|V, 6 fij] ^ , we find that 


E 


d)(Z„) I ll-Villoo > 1 

n 






r{K) 


which achieves the proof. 
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True Positive Rate True Positive Rate 


Appendix B. Experiments curves 


= 0 . 01 , k=n^/'‘ 


ROC curve for SA 




Figure B.8: SA dataset, default parameters 


e =0.01, k 


ROC curve for SF 



PR curve for SF 



Figure B.9: shuttle dataset, default parameters 
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t=0.01, 




Figure B.IO; http dataset, default parameters 
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